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Abstract 

The Informalion Power Grid (IPG) concept developed 
by NASA is aimed to provide a metacomputing platform 
for large-scale distributed computations, by hiding the in- 
tricacies of a highly heterogeneous environment and yet 
maintaining adee/uate security. In this paper, we propose a 
latency-tolerant partitioning scheme that dynamically bal- 
ances processor workloads on the IPG, and minimizes data 
movement and runtime communication. By simulating an 
unsteady adaptive mesh application on a wide area net- 
work, we study the performance of our load balancer un- 
der the Globus environment. The number of IPG nodes, 
the number of processors per node, and the interconnect 
speeds are parameterized to derive conditions under which 
the IPG would be .suitable for parallel distributed process- 
ing of such applications. Experimental results demonstrate 
that effective solutions are achieved when the IPG nodes 
arc connected by a high-speed asynchronous interconnec- 


L Introduction 

The Informalion Power Grid (IPG) infrastructure has 
been developed by NASA and other collaborative partners 
to harness the power of geographically distributed resources 
(computers, databases, and human expeitise) in order to 
solve large-scale computational problems. Applications 
that would benefit from such an infrastructure include: 

• Desktop coupling to remote supercomputers sc as to 
provide access to large databases and high-end graph- 
ics facilities [9]. 
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# User access to sophisticated uistmments tlirough re- 
mote supercomputer connections utilizing virtual real- 
ity techniques [8]. 

• Remote interactions with supercomputer simula- 
tions [10, 1 1]. 

Several attempts have recently been made to develop 
what are called compulational grid capabilities and/or im- 
plementations [14]. For example, the Condor system [19] 
is developed to manage research studies at workstations 
around the world. However, it did not adequately deal with 
the security issues involved. Other grid-based systems in- 
clude Nimrod [1], NelSolve [4], NCOS [5], Legion [15], 
and CAVERN [18]. The Globus Metacomputing Infrastruc- 
ture Toolkit [13] successfully provides a portable virtual 
machine environment. It supports mechanisms lor shar- 
ing remote resources, provides adequate security, and al- 
lows MPI-based message passing. Due to its portable and 
modular nature. Globus has been chosen by NASA as the 
middleware to implement the IPG. 

So far, limited studies have been performed to determine 
the viability of parallel distributed computing on the IPG. 
In [2], latency tolerance and load balancing modifications 
were implemented for a CFD application to compensate 
for slower communication speed. Results showed that the 
application ran faster under Globus on two IPG nodes of 
four processors each than on a single lightly-coupled ma- 
chine of eight processors. How'cver, this result is clouded 
in that asynchronous message passing was supported over 
the high-speed link but not within the single platform. With 
a goal to make more informative conclusions, in this pa- 
per we simulate an unsteady adaptive mesh application on a 
wide area network. The number of IPG nodes, the number 
of processors per node, and the interconnect speeds are pa- 
rameterized to derive conditions under which the IPG would 
be suitable for parallel distributed processing of such appli- 
cations. 


Earlier, we proposed two different load balancing ap- 
proaches with an unsteady adaptive mesh as the test case 
application. The first approach, called PLUM [21], is an 
architecture-independent framework which globally parti- 
tions the computational mesh after each adaptation and de- 
termines whether re-balancing the load would lead to re- 
duced total execution time. If an improvement in the load 
balance can be achieved, PLUM utilizes an effective remap- 
ping algorithm to minimize the required data movement. 
Application processing is temporarily suspended during the 
partitioning and data remapping operations. Utilization of a 
parallel graph partitioner like ParMeTiS [17] gives effective 
results. 

The second approach, called Symmetric Broadcast 
Networks (SBN) [7], gives a general-purpose topology- 
independent solution to dynamic load balancing. A salient 
feature of this approach is that it balances processor work- 
loads while the application is running. Therefore, it is able 
to hide the high data migration overhead, albeit at the cost of 
increased interprocessor communication. Results reported 
in [3] indicate that both PLUM and SBN approaches have 
their relative merits, and that they achieve excellent load 
balance with minimal extra overhead. 

Let us summarize the contributions of this paper. We 
propose a novel partitioner, called MinEX, that optimizes 
the two important steps of PLUM (balancing and remap- 
ping) as part of the partitioning process. Instead of attempt- 
ing to only balance the load like most other partitioners, 
the objective of MinEX is to minimize the total runtime of 
the application. This approach counters the possibility that 
perfectly balanced loads can still incur excessive communi- 
cation and redistribution costs while the application is being 
processed. MinEX is also used to experiment with the la- 
tency tolerant techniques on the IPG. Our experimental re- 
sults show that MinEX reduces the number of elements mi- 
grated by PLUM, and also lowers the percentage of edges 
cut by SBN. For example, for 32 partitions with our test 
case, PLUM showed an edge cut of 10.9% and redistributed 
63,270 mesh elements. The corresponding values for the 
SBN-based approach were 36.5% and 19,446. In contrast, 
the MinEX partitioner values were 20.9% and 30,548 re- 
spectively. Thus MinEX attempts to optimize both commu- 
nication and remapping costs, and hence is found to be an 
effective approach to latency hiding in dynamic load bal- 
ancing for grid computing. 

This paper is organized as follows. Section 2 introduces 
the computational application to be tested and determines 
its scalability. Section 3 describes the new MinEX parti- 
tioner. Section 4 describes the experimental study, analyzes 
the obtained results and draws conclusions as to the use of 
the IPG for this and similar applications. Section 5 con- 
cludes the paper. 


2. Test Case Scenario 

Many computational problems are often modeled as an 
unstruciured mesh of vertices and edges. To capture evolv- 
ing features, the mesh topology is also frequently adapted. 
For an efficient parallel implementation, this leads to dy- 
namic load balancing in the sense that mesh objects will 
have to be reassigned after each adaptation phase to re- 
balance the workload among processors. It is critical to 
minimize the overhead associated with remapping data sets, 
and to reduce the communication between processors at the 
next solution step. These goals are particularly important in 
the IPG context where communication bandwidth between 
nodes are likely to be much smaller than those within a sin- 
gle multiprocessor machine. 

The computational mesh considered for our experiments 
in this paper simulates an unsteady environment with a 
strongly time-dependent adapted region. As depicted in 
Fig. I, a shock wave is propagated through an initial grid 
to produce the desired effect. The computational mesh is 
processed through nine adaptations by moving a cylindri- 
cal volume across the domain with constant velocity. Grid 
elements within the cylindrical volume are refined while 
previously-refined elements are coarsened in its wake. Dur- 
ing the processing, the size of the mesh increases from 
50,000 elements to 1 ,833,730 elements. 



Figure 1 . Initial and adapted meshes (after lev- 
els 1 and 5) for the simulated experiment. 

To realistically simulate the overhead associated with 
an adaptive mesh computation, two weights are associated 
with each mesh vertex and one weight with each mesh edge. 
These weights respectively reflect the number of time units 
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Table 1. Scalability analysis of the test application. 


required for compuJation, data remapping, and communi- 
cation cost. The total time required to process the vertices 
assigned to a processor p must take into account all these 
three metrics as defined below. 

Processing Weight, Wgt'^ , is the computational cost to pro- 
cess a vertex v. 

Redistribution Cost, Remaps, is the overhead to copy the 
data set associated with v from p to another processor. This 
cost incurred at p includes operations like data packing and 
initiating transmission. The redistribution cost incurred by 
the processor receiving v is the sum of the communication 
cost and the operations of unpacking and merging the data 
into existing data structures. Clearly, if the data set for v is 
already assigned to p, no redistribution cost is incurred. 
Communication Cost, Comm^, is the cost to interact with 
all vertices adjacent to v but whose data sets are not local to 
p. Thus, if the data sets of all the vertices adjacent to v are 
also assigned to p, the communication cost, Comnip, is 0. 

We also use six additional metrics which are defined be- 
low. 

Weighted Queue Length, QWgt(p), is the total cost to pro- 
cess the vertices assigned to p. It is defined as; 

QWgt(p) = ^ {Wgt'’ + Comnip -1- Remap^). 

V assigned to p 

Total System Load, QWgtTOT, is the sum of QWgt(p) over 
all processors. This metric is used in Section 3.2 to decide 
whether it is appropriate to reassign a vertex from one pro- 
cessor to another. 

Heaviest Load, MaxQWgt, is the maximum value of 
QWgt(p) over all processors, and indicates the total time 
required to process the application. 

Lightest Load, MinQWgt, is the minimum value of 
QWgt(p) overall processors, and indicates the workload of 
the most lightly-loaded processor. 

Average Load, AvgQWgt, is QWgtTOT/F, where P is the 
total number of processors. 

Load Imbalance Factor, Loadimb, represents the quality 
of the partitioning and is defined as MaxQWgt / AvgQWgt. 

Table 1 shows the scalability of our test application 
where P is varied from 2 to 2048. The data was obtained 
by simulating the application (details in Section 4). Each 
column reflects non-dimensionalized MaxQWgt values in 
thousands. The first row of the table assumes that maxi- 
mum latency tolerance is achieved, while the second row 


assumes that no latency tolerance is achieved. By maximum 
latency tolerance, we mean the ability to utilize all avail- 
able processors to overlap communication and redistribu- 
tion costs. Further explanations are provided in Section 3. 
Table 1 shows that this application can scale to over 128 
processors with linear speedup, and therefore is a good can- 
didate for an IPG implementation. 

3. MinEX: A New Partitioner 

Previous studies with this mesh application under 
PLUM utilized a variety of general partitioners such as 
ParMeTiS [17], UAMeTiS [22], DAMeTiS [22], Jostle- 
MS [23], and Jostle-MD [23]. Note that UAMeTiS, 
DAMeTiS, and Jostle-MD are diffusive schemes designed 
to modify existing partitions to produce a processor alloca- 
tion; whereas PMeTiS and Jostle-MS are global partitioners 
which make no assumptions about the original mesh distri- 
bution. Although all these partitioners achieve good load 
balance while minimizing communication overhead, they 
fail to consider the cost of moving data between proces- 
sors. A unique feature of PLUM is to address this draw- 
back through the use of an efficient heuristic procedure for 
redistributing data to assigned processors. 

In the following, we design, implement, and analyze 
a novel partitioner, called MinEX, that optimizes compu- 
tational, communication, and data remapping costs. We 
also redefine the partitioning goal from producihg balanced 
loads to minimizing MaxQWgt. No direct comparisons 
with other existing partitioners mentioned above are pos- 
sible because MinEX also considers the data redistribution 
cost while partitioning the computational mesh. 

3.1. Design Principles 

MinEX can be classified as a diffusive multilevel parti- 
tioner. Diffusive algorithms [6] utilize an existing partition 
as a starting point instead of partitioning from scratch. The 
multi-level approach, originally introduced in [16], parti- 
tions the graph in three steps — contraction, partitioning, 
and refinement — each of which is described below. 

Similar to other multilevel partitioners, the first step in 
MinEX is to contract the mesh to a reasonable size. How- 
ever, instead of repeatedly contracting the mesh in halves. 
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MaxQWgt 

1 
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312 
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2 

1847 

1142 

748 

467 

320 
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305 

318 

345 


3 

2035 

1801 

674 

556 

375 

331 

324 

326 

382 


4 

1868 

1516 

761 

639 

412 

352 

328 

371 

425 


5 

1834 

1626 

835 

767 

438 

373 

359 

343 

400 


6 

2081 

1579 

898 

825 

481 

391 

357 

361 

427 


7 

1884 

1279 

1032 

758 

505 

383 

371 

369 

414 


8 

1944 

1451 

1 102 

834 

531 

434 

376 

380 

435 

Loadimb 

1 

7.05 

5.09 

1.23 

1.1 1 

1.01 

1.00 

1.00 

1.00 

1.00 


2 

8.54 

4.16 

2.74 

I.8I 

1.26 

1.14 

1.04 

1.00 

1.00 


3 

7.15 

6.40 

2.50 

! 2.11 

1.41 

1.19 

1.05 

1 1.02 

1.01 


4 

6.63 

5.41 

2.82 

2.40 

1.58 

1.26 

1.07 

1.03 

1.01 


5 

6.53 

5.78 

3,06 

2.83 

1.66 

1.30 

1.11 

1.02 

1.01 


6 

7.31 

5.58 

3.25 

2.99 

1.81 

1.40 

1.08 

1.02 

1.01 


7 

6.68 

4.61 

3.74 

2.80 

1.84 

1.33 

1.10 

1.03 

1.00 


8 

6.90 

5.15 

3.92 

3.05 

1.94 

1.43 

1.13 

1.06 

1.00 


Table 2. Expected runtime and load balance quality for varying ThroTTIe values. 


MinEX sequentially contracts one vertex at a time. The ad- 
vantage of this approach is that a decision can be made each 
time a vertex is later refined as to whether it should be as- 
signed to another processor. This makes the algorithm more 
flexible since the graph does not have to be doubled in size 
before this decision could be made. If \V\ is the number 
of vertices in the mesh, contraction requires 0(|K|) steps 
which is asymptotically no larger than that of contracting 
the mesh sequentially in halves. Once the mesh is suffi- 
ciently small, the remaining vertices are reassigned accord- 
ing to the partitioning criteria described in Section 3.2. 

The mesh is expanded back to its original size through 
a refinement process. As each vertex is refined, a decision 
is made as to whether or not it should be reassigned. This 
decision employs the same partitioning criteria used by the 
partitioning algorithm in the previous step. Each coarse ver- 
tex reassignment in effect reassigns all of the computational 
vertices that the coarse vertex represents. 

3.2. Partitioning Criteria 

The criteria for deciding whether a vertex should be reas- 
signed from one processor to another, is based on two met- 
rics: Gain and MinVar. Gain represents the change in 
QWgtTOT that would result from a proposed vertex move. 
A negative Gain would indicate that less total processing is 
required after such a vertex reassignment. The partitioning 
algorithm favors vertex moves with negative or small Gain 
values that reduce or minimize overall system load. 

MinVar is computed using the workload (i.e. QWgt(p)) 
for each processor p and the smallest load of any processor 


(MinQWgt) in accordance with the following formula: 
MinVar = ^(QWgt(p) - MinQWgt)^. 

p 

Basically, MinVar computes the variance of processor 
workloads from that of the most lightly-loaded processor. 
The objective is to initiate vertex moves that lower this 
value. Since processors with large QWgt(p) values will 
have large MinVar components, this criteria tends to move 
vertices away from processors that have high runtime re- 
quirements. AMinVar is the change in the MinVar value 
after moving a vertex from one processor to another. A neg- 
ative value indicates that MinVar has been reduced. 

Let us now describe how the partitioning decisions are 
made. For each vertex, v, consider all edges to adjacent 
vertices that are assigned to other processors. Compute the 
Gain and MinVar values that would result from moving 
V to each of the adjacent processors. The move involves the 
adjacent vertex that has the smallest value of Gain as long 
as AMinVar < 0 and -Gain/AMinVar < ThroTTIe, 
where ThroTTIe is a user-supplied parameter. To increase 
efficiency, the program utilizes a minimum heap with point- 
ers to vertex locations to quickly find the best move and di- 
rectly delete entries without searching. 

Conceptually, ThroTTIe acts as a gateway that limits 
increases in Gain based upon how much of an improve- 
ment in MinVar can be achieved. Table 2 shows how 
varying ThroTTIe values affects the expected application 
runtime (MaxQWgt) and load balance quality (Loadimb). 
The MaxQWgt entries are non-dimensionalized values in 
thousands. These results were obtained by running the ex- 



periments described in Section 4. Table 2 assumes a net- 
work of 32 homogeneous processors distributed over one to 
eight IPG nodes (clusters). The inter-cluster interconnect 
speed is assumed to be a third of the intra-cluster speed. 
Results show that a ThroTTle of 64 produces the lowest 
overall MaxQWgt, and that larger ThroTTle values im- 
prove Loadimb. Experiments with other network sizes us- 
ing this same application have shown that ThroTTle gen- 
erally converges at values between P and 2P. Note also 
that for large values of ThroTTle, better Loadimb does 
not necessarily imply lower MaxQWgt. 

3.3. Latency Tolerance 

The following steps illustrate how communication and 
data redistribution can be reduced or eliminated. 

Step 1: Initiate send of all data sets to be redistributed. 

Step 2: For each edge (v,w), where the data set for vertex 
V is local to processor p and the data set for vertex w is 
local to another processor g, initiate send of communication 
data. The metric represents the cost of this 

communication. Also initiate send of communication data 
needed by adjacent processors. 

Step 3: Process vertices that are not waiting for incoming 
transmissions. 

Step 4: Receive and unpack any remapped data sets des- 
tined for this processor. 

Step 5: Receive and unpack communication data destined 
for this processor. 

Step 6: Repeat Steps 2 through 5 until all vertices are pro- 
cessed. 

These steps implement a strategy where processors dis- 
tribute data sets and communication data as early as possi- 
ble. The processing of internal vertices can then take place 
while waiting for expected incoming messages. As data sets 
and communication data are received, additional communi- 
cations can be initiated and vertices processed. The most 
optimistic expectation of this strategy is that the process- 
ing activity can entirely hide the data redistribution cost and 
communication latency. At the other extreme, the most pes- 
simistic view is that no latency tolerance is achieved. Exper- 
iments simulating both views to analyze the effect of latency 
tolerance on our test application are described in Section 4. 

3.4. Data Structures 

The following data structures are used by the MinEX 
partitioner to perform its multilevel algorithm: 

• Mesh: The adaptive mesh has the format 

{1^1, |E|, vTot, *VMaP, *VList, *EList} where 
|V^| is the number of active vertices in the mesh, 
jj?| is the number of edges in the mesh. 


vTot is the total number of vertices (including merged 
vertices), 

*VMaP is a pointer to the list of active vertices, 
*VList is a pointer to the complete list of vertices, 
and 

♦ EList is a pointer to the list of edges. 

• VmaP: A list of active vertices. None of these vertices 
have been compressed through multilevel partitioning. 

• VList: A complete list of vertices. Each vertex, v, is 
defined by a VList record as 

{Wgt, Remapp, |el, *e, merge, lookup, *vmap, * 

heap, border] where 

Wgt is the computational cost to process v, 

Remapp is the redistribution cost to copy the data set 
associated with v to another processor from p, 

|el is the number of adjacent edges associated with v, 
*e is a pointer to the first edge associated with v (sub- 
sequent edges are stored in contiguous memory loca- 
tions), 

merge is the vertex that was merged with v during 
a contraction operation (set to —1 if no merge took 
place), 

lookup is the active vertex that contains v after a series 
of contraction operations (set to —1 if no merges took 
place), 

*vmap is a pointer to the*, position of v in the active 
vertex table, 

*heap is a pointer to the heap entry that relates to ver- 
tex, V, and represents a potential reassignment of v, 
and border is a boolean flag indicating whether v is 
adjacent to vertices assigned to other processors. 

• EList: A list of edges in the mesh. Each record is de- 
fined as {tw, where {v, w) is an edge and 

Comm(p^,p) is the associated communication weight. 
Vertex v has an entry in VList and edges are located 
using the *e pointer. 

• Heap: The heap of potential vertex reassignments. 
Each heap record is defined as {Gain, AMinVar, 
V, p] which specifies the Gain and AMinVar that 
would result from reassigning vertex v to processor p. 
The min-heap is keyed by the Gain value. 

• stack: The stack of compressed vertex pairs, 

(vi,V2). These vertices are refined in reverse order 
from the order that they were compressed. This graph 
contraction technique is described below. 

3.5. Graph Contraction 

The partitioner selects sets of randomly chosen pairs 
of vertices that are assigned to the same processor p. 


From this set, the vertex pair, that has the largest 

/ [Remap p -f Remap^^) value is merged. This 
formula attempts to find edges with large communication 
costs while minimizing the potential data redistribution 
overhead. The motivation behind this strategy is to arrive 
at a contracted mesh with a small edge cut and a small data 
distribution cost. 

To contract a vertex v, a merged vertex record, M, is 
created and the edge (u, to) is collapsed. The edges of M 
are generated by utilizing the edge lists of vertices v and w. 
VMap is adjusted to contain M and to remove v and w; \V\ 
is decremented and vTot is incremented; |E| is increased 
by the number of edges created for M; and the pair [v,w) 
is pushed onto Stack. 

This contraction procedure is implemented using a set 
union/find algorithm so that edges of existing vertices can 
remain unchanged. For example, if an existing vertex is 
adjacent to v, accesses to its EList record will check 
whether v has been merged. If it has, lookup will be ac- 
cessed to quickly find the appropriate merged vertex. If 
lookup is not current (i.e., lookup > vTot), the union/find 
algorithm will search the chain of vertices beginning with 
merge in order to update the lookup value, so that subse- 
quent lookups can be done efficiently. Pseudo code describ- 
ing the union/find procedure is given in Fig. 2, 


Procedure Find (v) 

If (merge -1) Return v 
If (lookup I = — 1) And (lookup <= vTot) 
Then Return lookup = Find (lookup) 
Else Return lookup = Find (merge) 


Figure 2. The union/find algorithm. 

3.6. Partitioning the Contracted Graph 

Once the graph contraction process is complete, the par- 
titioning can be performed. Because the number of vertices 
is greatly reduced, the MinEX algorithm can execute very 
efficiently. The algorithm considers every remaining vertex 
of the mesh to find potential reassignments that will reduce 
Gain and MinVar as described in Section 3.2. All poten- 
tial vertex reassignments are added to the min-heap. Actual 
reassignments are executed in heap order. As a reassign- 
ment is executed, the heap is adjusted to reflect the new 
partition status. 

3.7. Graph Expansion 

The graph is restored to its original size by expanding 
pairs of vertices in an order reversed from which they were 
merged. The Stack data structure controls the order. As 


pairs of vertices, (v, tu), are refined, merged edges and ver- 
tices are deallocated. The merge and lookup vertex num- 
bers are also adjusted in the vertex table. The VMap table 
is updated to delete the merged vertex, M, and to add v and 
w\ (r| is incremented and vTot is decremented; and |£| is 
decreased by the number of edges created for M. After each 
refinement, a decision is made as to whether a partition can 
be improved by reassigning v or m. When reassignments 
are made, adjacent border vertices are also considered. 

4. Performance Results 

The MinEX partitioner was executed with actual appli- 
cation data to simulate an adaptive mesh computation for 
a variety of system configurations. Individual runs model 
networks with a particular number of processors P, number 
of IPG nodes/clusters C, ThroTTle values, and intercon- 
nect speeds I. In our experiments, P was varied from 2 
to 2048, C was varied from I to 8, ThroTTle was varied 
to find the optimal value for minimizing runtime, and I was 
varied to simulate both high-speed cluster interconnects and 
low-speed wide area network connections. 

Based on performance studies reported in [12, 20], typ- 
ical communication latency and bandwidth slowdown from 
integrated clusters to configurations connected through a 
high-speed interconnect are in the range of 3 to 100. Wide 
area network connections are 1 4)00 to 10,000 times slower 
than the internal intra-connects of a single cluster. In our ex- 
periments, we have assumed that the intra-cluster commu- 
nication speed is normalized to a value of 1 . Simulations 
of inter-cluster communication assumed slowdown factors 
of 3, 10, 100, and 1,000. To simplify the analysis, we have 
assumed that individual processors are homogeneous and 
divided as evenly as possible among the clusters. 

Table 3 shows results of experimental runs analyzing 
the effect of varying numbers of clusters and intercon- 
nect speeds, assuming P = 32 homogeneous processors. 
The interconnect speeds indicate the slowdown factor rela- 
tive to the intra-cluster communication speed. To be con- 
sistent with Tables I and 2, runtimes are shown as non- 
dimensionalized values in thousands. Table 3(a) charts the 
experimental results when no latency tolerance is achieved, 
while Table 3(b) assumes maximum latency tolerance. The 
following conclusions can be drawn from the experiments. 

As the interconnect speed is reduced, the slowdown ex- 
perienced by utilizing additional clusters increases dramati- 
cally. For example, the runtime metric in Table 3(a) is 4, 102 
when two clusters and an interconnect slowdown of 1000 is 
assumed; however, the metric is 93,566 when eight clusters 
are assumed. Thus, performance deteriorates by almost a 
factor of 22.8. If we consider an interconnect slowdown of 
3, the performance degradation is only 1.3. The same pat- 
tern holds true in Table 3(b). 




Interconnect Speeds 

Clusters 

3 

10 

100 

1000 

1 

473 

473 

473 

473 

2 

728 

863 

1228 

4102 

3 

755 

1168 

2783 

18512 

4 

791 

1361 

3667 

25040 

5 

854 

1649 

5677 

53912 

6 

915 

1717 

8521 

76169 

7 

956 

1915 

10958 

80568 

8 

968 

2178 

11492 

93566 


(a) No latency tolerance 



Interconnect Speeds 

Clusters 

3 

10 

100 

1000 

1 

287 

287 

287 

287 

2 

298 

469 

763 

3941 

3 

322 

548 

2386 

12705 

4 

328 

680 

3297 

21888 

5 

336 

768 

4369 

33092 

6 

345 

856 

5044 

52668 

7 

352 

893 

5480 

61079 

8 

357 

1048 

5721 

61321 


(b) Maximum latency tolerance 


Table 3. Expected runtime for varying cluster sizes (P = 32) and interconnect speeds. 


For the mesh application considered. Globus over low- 
speed networks such as the Internet is not a viable approach 
assuming current technology. In fact, the interconnection 
speed must improve by at least an order of magnitude be- 
fore this approach could be useful. At present, applications 
would have to have little runtime communication and data 
set remapping for low-speed wide area networks to be prac- 
tical interconnects. 

We can compare the effectiveness of latency tolerant al- 
gorithms to those without latency tolerance, by measuring 
runtimes of each approach as the number of clusters and 
interconnect speeds are varied. The performance improve- 
ments using latency tolerance increase dramatically as the 
number of clusters increases. This can be verified by com- 
paring the same rows from Tables 3(a) and 3(b). For exam- 
ple, consider the results with eight clusters. The runtime im- 
provements comparing latency tolerant algorithms to those 
with no latency tolerance are factors of 2.7, 2. 1 , 2.0, and 1 .5, 
respectively, for interconnect slowdowns of 3, 10, 100, and 
1000. In contrast, results with two clusters indicate gains of 
2.4, 1 .8, 1 .6, and 1 .0, respectively, for the same interconnect 
slowdowns. Results clearly demonstrate that utilizing more 
clusters give greater runtime improvement when employing 
latency tolerance. 

The same is also true when the interconnect slowdowns 
are varied (this can be analyzed by comparing the corre- 
sponding table columns). For example, with an intercon- 
nect slowdown of 1000, the improvements in runtime by 
utilizing latency tolerance are 1.6, 1.0, 1.5, 1.1, 1.6, 1.4, 
1.3, and 1.5, respectively, for one to eight clusters. On the 
other hand, with an interconnect slowdown of 10, the cor- 
responding improvements are 1.6, 1.8, 2.1, 2.0, 2.1, 2.0, 
2.1, and 2.1. In this case, results suiprisingly demonstrate 
that latency toleiance has a bigger payoff when intercon- 
nect slowdowns are smaller. Additional investigations are 
required to verify/counter this observation. 

For our test application. Globus could be a viable ap- 
proach if a high-speed interconnect (slowdown factor be- 


tween 3 and 10) between clusters is utilized. Results in Ta- 
bles 3(a) and 3(b) comparing one and eight clusters with an 
interconnect slowdown of 3 show luntime deterioration fac- 
tors of 2.04 and 1 .24, respectively. Similar comparisons for 
an interconnect slowdown of 10 show deterioration factors 
of 4.60 and 3.65, respectively. Tliese factors, being smaller 
than the number of clusters, indicate a relative speedup 
when the number of clusters increases. 

5. Conclusions 

We presented a latency-tolerant partitioner, called 
MinEX, that not only balances processor workloads but also 
minimizes data movement and runtime communication, for 
adaptive mesh applications that are executed in a parallel 
distributed fashion on the IPG. Additional future experi- 
ments that are planned will test MinEX performance in the 
context of different application classes and devise metrics 
to compare it with other popular partitioning schemes. We 
also analyzed the conditions that are required for the IPG to 
be an effective tool for such distributed computations. Our 
results demonstrated that MinEX is a viable load balancer 
provided the IPG nodes are connected by a high-speed asyn- 
chronous interconnection network. We are currently imple- 
menting a parallel version of MinEX. An area of further 
research includes mathematical analysis of latency toler- 
ance and performance slowdowns based on the interconnect 
speed, the numbers of clusters employed, and the topology 
of the mesh. 
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